Confidence Decision Trees via Online and Active Learning for Streaming Data

نویسندگان

  • Rocco De Rosa
  • Nicolò Cesa-Bianchi
چکیده

Decision tree classifiers are a widely used tool in data stream mining. The use of confidence intervals to estimate the gain associated with each split leads to very effective methods, like the popular Hoeffding tree algorithm. From a statistical viewpoint, the analysis of decision tree classifiers in a streaming setting requires knowing when enough new information has been collected to justify splitting a leaf. Although some of the issues in the statistical analysis of Hoeffding trees have been already clarified, a general and rigorous study of confidence intervals for splitting criteria is missing. We fill this gap by deriving accurate confidence intervals to estimate the splitting gain in decision tree learning with respect to three criteria: entropy, Gini index, and a third index proposed by Kearns and Mansour. We also extend our confidence analysis to a selective sampling setting, in which the decision tree learner adaptively decides which labels to query in the stream. We provide theoretical guarantees bounding the probability that the decision tree learned via our selective sampling strategy classifies suboptimally the next example in the stream. Experiments on real and synthetic data in a streaming setting show that our trees are indeed more accurate than trees with the same number of leaves generated by state-ofthe-art techniques. In addition to that, our active learning module empirically uses fewer labels without significantly hurting the performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Confidence Decision Trees via Online and Active Learning for Streaming (BIG) Data

Decision tree classifiers are a widely used tool in data stream mining. The use of confidence intervals to estimate the gain associated with each split leads to very effective methods, like the popular Hoeffding tree algorithm. From a statistical viewpoint, the analysis of decision tree classifiers in a streaming setting requires knowing when enough new information has been collected to justify...

متن کامل

Online tree-based ensembles and option trees for regression on evolving data streams

The emergence of ubiquitous sources of streaming data has given rise to the popularity of algorithms for online machine learning. In that context, Hoeffding trees represent the state-of-the-art algorithms for online classification. Their popularity stems in large part from their ability to process large quantities of data with a speed that goes beyond the processing power of any other streaming...

متن کامل

Online Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features

Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...

متن کامل

On the Use of Provalets in a Predictive Maintenance Use Case

In this paper we report on a predictive maintenance use cases using Provalet rule agents for implementing expressive rule-based streaming analytics and decision logic on top of online machine learning prediction models, which are dynamically applied to the streaming data coming from on-board asset monitoring sensors. Provalets are component-based mobile agents for rule-based inference analytics...

متن کامل

Investigating Students' Use of Lecture Videos in Online Courses: A Case Study for Understanding Learning Behaviors via Data Mining

This study investigated students’ learning behaviors in a fully online psychology course which offered 76 streaming lecture videos and supplementary resources, as well as individual and group activities. This paper focuses on students’ use of lecture videos. Data collection included students’ real usage of data on Blackboard Learn 9.1, a course survey, and students’ final grades. The analysis a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Artif. Intell. Res.

دوره 60  شماره 

صفحات  -

تاریخ انتشار 2017